The Iris dataset, also known as Fisher’s Iris or Anderson’s Iris, is a multivariate dataset introduced in 1936 by Ronald Fisher in his paper The use of multiple measurements in taxonomic problems as an example of the application of linear discriminant analysis1 . Data were collected by Edgar Anderson to quantify variations in the morphology of iris flowers of three species1. Two of the three species were collected in the Gaspé Peninsula. “All are from the same field, picked on the same day and measured on the same day by the same person with the same measuring tools1”.
The data set includes 50 samples of each of the three iris species (Iris setosa, Iris virginica and Iris versicolor). Four characteristics were measured from each sample: length and width of sepals and petals, in centimetres. Based on the combination of these four variables, Fisher developed a linear discriminant analysis model to distinguish between the species.
In this project, I found :
Skills developed: Visualization, Supervised machine learning, Unsupervised machine learning
iris %>%
select_if(is.numeric) %>%
gather() %>%
ggplot(aes(x=key, y=value, fill=key)) +
geom_boxplot() +
labs(title = "Boxplot of each numeric variables",
x = "variables")
skim(iris)
| Name | iris |
| Number of rows | 150 |
| Number of columns | 5 |
| _______________________ | |
| Column type frequency: | |
| factor | 1 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| Species | 0 | 1 | FALSE | 3 | set: 50, ver: 50, vir: 50 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Sepal.Length | 0 | 1 | 5.84 | 0.83 | 4.3 | 5.1 | 5.80 | 6.4 | 7.9 | ▆▇▇▅▂ |
| Sepal.Width | 0 | 1 | 3.06 | 0.44 | 2.0 | 2.8 | 3.00 | 3.3 | 4.4 | ▁▆▇▂▁ |
| Petal.Length | 0 | 1 | 3.76 | 1.77 | 1.0 | 1.6 | 4.35 | 5.1 | 6.9 | ▇▁▆▇▂ |
| Petal.Width | 0 | 1 | 1.20 | 0.76 | 0.1 | 0.3 | 1.30 | 1.8 | 2.5 | ▇▁▇▅▃ |
iris %>%
group_by(Species) %>%
skim()
| Name | Piped data |
| Number of rows | 150 |
| Number of columns | 5 |
| _______________________ | |
| Column type frequency: | |
| numeric | 4 |
| ________________________ | |
| Group variables | Species |
Variable type: numeric
| skim_variable | Species | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Sepal.Length | setosa | 0 | 1 | 5.01 | 0.35 | 4.3 | 4.80 | 5.00 | 5.20 | 5.8 | ▃▃▇▅▁ |
| Sepal.Length | versicolor | 0 | 1 | 5.94 | 0.52 | 4.9 | 5.60 | 5.90 | 6.30 | 7.0 | ▂▇▆▃▃ |
| Sepal.Length | virginica | 0 | 1 | 6.59 | 0.64 | 4.9 | 6.23 | 6.50 | 6.90 | 7.9 | ▁▃▇▃▂ |
| Sepal.Width | setosa | 0 | 1 | 3.43 | 0.38 | 2.3 | 3.20 | 3.40 | 3.68 | 4.4 | ▁▃▇▅▂ |
| Sepal.Width | versicolor | 0 | 1 | 2.77 | 0.31 | 2.0 | 2.52 | 2.80 | 3.00 | 3.4 | ▁▅▆▇▂ |
| Sepal.Width | virginica | 0 | 1 | 2.97 | 0.32 | 2.2 | 2.80 | 3.00 | 3.18 | 3.8 | ▂▆▇▅▁ |
| Petal.Length | setosa | 0 | 1 | 1.46 | 0.17 | 1.0 | 1.40 | 1.50 | 1.58 | 1.9 | ▁▃▇▃▁ |
| Petal.Length | versicolor | 0 | 1 | 4.26 | 0.47 | 3.0 | 4.00 | 4.35 | 4.60 | 5.1 | ▂▂▇▇▆ |
| Petal.Length | virginica | 0 | 1 | 5.55 | 0.55 | 4.5 | 5.10 | 5.55 | 5.88 | 6.9 | ▃▇▇▃▂ |
| Petal.Width | setosa | 0 | 1 | 0.25 | 0.11 | 0.1 | 0.20 | 0.20 | 0.30 | 0.6 | ▇▂▂▁▁ |
| Petal.Width | versicolor | 0 | 1 | 1.33 | 0.20 | 1.0 | 1.20 | 1.30 | 1.50 | 1.8 | ▅▇▃▆▁ |
| Petal.Width | virginica | 0 | 1 | 2.03 | 0.27 | 1.4 | 1.80 | 2.00 | 2.30 | 2.5 | ▂▇▆▅▇ |
iris %>%
tbl_summary(by = Species,
type = all_continuous() ~ "continuous2",
statistic = all_continuous() ~ c("{median} ({p25}-{p75})", "{mean} ({sd})")) %>%
add_overall(last = TRUE) %>%
add_stat_label()
| Characteristic | setosa, N = 50 | versicolor, N = 50 | virginica, N = 50 | Overall, N = 150 |
|---|---|---|---|---|
| Sepal.Length | ||||
| Median (25%-75%) | 5.00 (4.80-5.20) | 5.90 (5.60-6.30) | 6.50 (6.23-6.90) | 5.80 (5.10-6.40) |
| Mean (SD) | 5.01 (0.35) | 5.94 (0.52) | 6.59 (0.64) | 5.84 (0.83) |
| Sepal.Width | ||||
| Median (25%-75%) | 3.40 (3.20-3.68) | 2.80 (2.53-3.00) | 3.00 (2.80-3.18) | 3.00 (2.80-3.30) |
| Mean (SD) | 3.43 (0.38) | 2.77 (0.31) | 2.97 (0.32) | 3.06 (0.44) |
| Petal.Length | ||||
| Median (25%-75%) | 1.50 (1.40-1.58) | 4.35 (4.00-4.60) | 5.55 (5.10-5.88) | 4.35 (1.60-5.10) |
| Mean (SD) | 1.46 (0.17) | 4.26 (0.47) | 5.55 (0.55) | 3.76 (1.77) |
| Petal.Width | ||||
| Median (25%-75%) | 0.20 (0.20-0.30) | 1.30 (1.20-1.50) | 2.00 (1.80-2.30) | 1.30 (0.30-1.80) |
| Mean (SD) | 0.25 (0.11) | 1.33 (0.20) | 2.03 (0.27) | 1.20 (0.76) |
Important : Only numeric variables can be used in PCA.
corrplot(round(cor(select_if(iris, is.numeric)),2),
type="upper",
order="hclust",
tl.col="black",
tl.srt=45)
res.pca <- PCA(select_if(iris, is.numeric), graph = FALSE)
plot(res.pca, choix = "var")
jpeg("images/pca_plot.jpg")
plot(res.pca, choix = "var")
dev.off()
## png
## 2
barplot(res.pca$eig[, 2], names.arg=1:nrow(res.pca$eig),
main = "Variances",
xlab = "Principal Components",
ylab = "Percentage of variances",
col ="steelblue")
# Add connected line segments to the plot
lines(x = 1:nrow(res.pca$eig), res.pca$eig[, 2],
type="b", pch=19, col = "red")
Interpretation : Sepal.Length, Pental.Width and Petal.Lenght are highly correlated variables: knowing one of the three variables gives a fairly good idea of the values of the others.
res.h <- hclust(dist(iris), method = "complete")
plot(res.h, hang = -1, cex = 0.6)
jpeg("images/results_hclust.jpg")
plot(res.h, hang = -1, cex = 0.6)
dev.off()
## png
## 2
# table(iris$Species,
# cutree(res.h, k=3))
res.iris <- tibble(species = iris$Species,
hclust = cutree(res.h, k=3) ) %>%
mutate(hclust = case_when(hclust == 1 ~ "setosa",
hclust == 2 ~ "virginica",
hclust == 3 ~ "versicolor") )
Interpretation : With hclust method, Setosa and virginica are recognised almost all the time. The species versicolor is confused half the time with the species virginica.
iris_split <- initial_split(iris, prop = 0.7)
iris_train <- iris_split %>% training()
iris_test <- iris_split %>% testing()
nearest_neighbor_kknn_spec <-
nearest_neighbor() %>%
set_engine('kknn') %>%
set_mode('classification')
knn_mod <- nearest_neighbor_kknn_spec %>%
fit(Species ~ ., iris_train)
knn_last_fit <- last_fit(nearest_neighbor_kknn_spec, recipe(Species ~ ., data = iris), iris_split)
knn.a <- accuracy(cbind(iris, predict(knn_mod, iris)), Species, .pred_class)$.estimate
res.iris$knn <- predict(knn_mod, iris)$.pred_class
knn.plot <- ggplot(res.iris, aes(x=knn, fill=species)) +
geom_bar() +
labs(title = "Knn model",
x="Species from KNN model")
ggplotly(knn.plot)
fviz_nbclust(select_if(iris, is.numeric), kmeans, method = "wss")
res.km <- kmeans(select_if(iris, is.numeric), centers = 3, nstart = 25)$cluster
res.km <- as.factor( ifelse(res.km == 1, "virginica", ifelse(res.km == 2, "versicolor", "setosa") ) )
# table(iris$Species, res.km)
km.a <- accuracy(cbind(iris, res.km), Species, res.km)$.estimate
res.iris$km <- res.km
km.plot <- ggplot(res.iris, aes(x=km, fill=species)) +
geom_bar() +
labs(title = "Kmeans model",
x="Species from kmeans model")
ggplotly(km.plot)
xgboost_parnsnip <-
boost_tree() %>%
set_engine('xgboost') %>%
set_mode('classification')
res.xgboost <- xgboost_parnsnip %>%
fit(Species ~ ., data = iris) %>%
predict(iris) %>%
pull(.pred_class)
xgboost.a <- accuracy(cbind(iris, res.xgboost), Species, res.xgboost)$.estimate
res.iris$xgboost <- res.xgboost
xgboost.plot <- ggplot(res.iris, aes(x=xgboost, fill=species)) +
geom_bar() +
labs(title = "Xgboost model",
x="Species from Xgboost model")
ggplotly(xgboost.plot)
ranger_parnsnip <-
rand_forest() %>%
set_engine('ranger') %>%
set_mode('classification')
res.ranger <- ranger_parnsnip %>%
fit(Species ~ ., data = iris) %>%
predict(iris) %>%
pull(.pred_class)
ranger.a <- accuracy(cbind(iris, res.ranger), Species, res.ranger)$.estimate
res.iris$ranger <- res.ranger
ranger.plot <- ggplot(res.iris, aes(x=ranger, fill=species)) +
geom_bar() +
labs(title = "Ranger model",
x="Species from Ranger model")
ggplotly(ranger.plot)
The objective right now is to compare the 4 models. 2 methods : graphically, and with accuracy of each model.
(knn.plot + km.plot)/ (xgboost.plot + ranger.plot)
The accuracy table :
data.frame(model = c("knn", "kmeans", "xgboost", "ranger"),
accuracy = c(knn.a, km.a, xgboost.a, ranger.a))
## model accuracy
## 1 knn 0.9733333
## 2 kmeans 0.3200000
## 3 xgboost 1.0000000
## 4 ranger 0.9800000
Analyse results with PCA :
res.pca.res <- PCA(res.iris %>%
mutate(species = case_when(species == "setosa" ~ 1,
species == "virginica" ~ 2,
species == "versicola" ~ 3),
hclust = case_when(hclust == "setosa" ~ 1,
hclust == "virginica" ~ 2,
hclust == "versicola" ~ 3),
knn = case_when(knn == "setosa" ~ 1,
knn == "virginica" ~ 2,
knn == "versicola" ~ 3),
km = case_when(km == "setosa" ~ 1,
km == "virginica" ~ 2,
km == "versicola" ~ 3),
xgboost = case_when(xgboost == "setosa" ~ 1,
xgboost == "virginica" ~ 2,
xgboost == "versicola" ~ 3),
ranger = case_when(ranger == "setosa" ~ 1,
ranger == "virginica" ~ 2,
ranger == "versicola" ~ 3)), graph = FALSE)
plot(res.pca.res, choix = "var")
iris_pca <- PCA(iris %>% select(-Species),
ncp = 3,
graph = FALSE)$ind$coord %>%
as_tibble() %>%
mutate(Species = case_when(iris$Species == "setosa" ~ 1,
iris$Species == "virginica" ~ 2,
iris$Species == "versicolor" ~ 3),
Species = as_factor(Species))
res.h <- hclust(dist(iris_pca), method = "complete")
plot(res.h, hang = -1, cex = 0.6)
# table(iris$Species,
# cutree(res.h, k=3))
res.iris <- tibble(species = iris$Species,
hclust = cutree(res.h, k=3) ) %>%
mutate(hclust = case_when(hclust == 1 ~ "setosa",
hclust == 2 ~ "virginica",
hclust == 3 ~ "versicolor") )
iris_pca_split <- initial_split(iris_pca, prop = 0.7)
iris_pca_train <- iris_pca_split %>% training()
iris_pca_test <- iris_pca_split %>% testing()
nearest_neighbor_kknn_spec <-
nearest_neighbor() %>%
set_engine('kknn') %>%
set_mode('classification')
knn_mod <- nearest_neighbor_kknn_spec %>%
fit(Species ~ ., iris_pca_train)
knn_last_fit <- last_fit(nearest_neighbor_kknn_spec, recipe(Species ~ ., data = iris_pca), iris_pca_split)
knn.a.pca <- accuracy(cbind(iris_pca, predict(knn_mod, iris_pca)), Species, .pred_class)$.estimate
res.iris$knn <- predict(knn_mod, iris_pca)$.pred_class
knn.plot.pca <- ggplot(res.iris, aes(x=knn, fill=species)) +
geom_bar() +
labs(title = "Knn model",
x="Species from KNN model")
ggplotly(knn.plot.pca)
fviz_nbclust(select_if(iris_pca, is.numeric), kmeans, method = "wss")
res.km <- kmeans(select_if(iris_pca, is.numeric), centers = 3, nstart = 25)$cluster
res.km <- as.factor( ifelse(res.km == 1, "virginica", ifelse(res.km == 2, "versicolor", "setosa") ) )
# table(iris$Species, res.km)
km.a.pca <- accuracy(cbind(iris, res.km), Species, res.km)$.estimate
res.iris$km <- res.km
km.plot.pca <- ggplot(res.iris, aes(x=km, fill=species)) +
geom_bar() +
labs(title = "Kmeans model",
x="Species from kmeans model")
ggplotly(km.plot.pca)
xgboost_parnsnip <-
boost_tree() %>%
set_engine('xgboost') %>%
set_mode('classification')
res.xgboost <- xgboost_parnsnip %>%
fit(Species ~ ., data = iris_pca) %>%
predict(iris_pca) %>%
pull(.pred_class)
xgboost.a.pca <- accuracy(cbind(iris_pca, res.xgboost), Species, res.xgboost)$.estimate
res.iris$xgboost <- res.xgboost
xgboost.plot.pca <- ggplot(res.iris, aes(x=xgboost, fill=species)) +
geom_bar() +
labs(title = "Xgboost model",
x="Species from Xgboost model")
ggplotly(xgboost.plot.pca)
ranger_parnsnip <-
rand_forest() %>%
set_engine('ranger') %>%
set_mode('classification')
res.ranger <- ranger_parnsnip %>%
fit(Species ~ ., data = iris_pca) %>%
predict(iris_pca) %>%
pull(.pred_class)
ranger.a.pca <- accuracy(cbind(iris_pca, res.ranger), Species, res.ranger)$.estimate
res.iris$ranger <- res.ranger
ranger.plot.pca <- ggplot(res.iris, aes(x=ranger, fill=species)) +
geom_bar() +
labs(title = "Ranger model",
x="Species from Ranger model")
ggplotly(ranger.plot.pca)
The objective right now is to compare the 4 models. 2 methods : graphically, and with accuracy of each model.
(knn.plot.pca + km.plot.pca)/ (xgboost.plot.pca + ranger.plot.pca)
The accuracy table :
data.frame(model = c("knn", "kmeans", "xgboost", "ranger"),
accuracy = c(knn.a, km.a, xgboost.a, ranger.a),
accuracy_PCA = c(knn.a.pca, km.a.pca, xgboost.a.pca, ranger.a.pca))
## model accuracy accuracy_PCA
## 1 knn 0.9733333 0.9466667
## 2 kmeans 0.3200000 0.2600000
## 3 xgboost 1.0000000 1.0000000
## 4 ranger 0.9800000 1.0000000
iris_scaled <- scale(iris %>% select(-Species), center = FALSE, scale = TRUE) %>%
as_tibble() %>%
mutate(Species = iris$Species,
Species = case_when(iris$Species == "setosa" ~ 1,
iris$Species == "virginica" ~ 2,
iris$Species == "versicolor" ~ 3),
Species = as_factor(Species))
res.h <- hclust(dist(iris_scaled), method = "complete")
plot(res.h, hang = -1, cex = 0.6)
# table(iris$Species,
# cutree(res.h, k=3))
res.iris <- tibble(species = iris$Species,
hclust = cutree(res.h, k=3) ) %>%
mutate(hclust = case_when(hclust == 1 ~ "setosa",
hclust == 2 ~ "virginica",
hclust == 3 ~ "versicolor") )
iris_scaled_split <- initial_split(iris_scaled, prop = 0.7)
iris_scaled_train <- iris_scaled_split %>% training()
iris_scaled_test <- iris_scaled_split %>% testing()
nearest_neighbor_kknn_spec <-
nearest_neighbor() %>%
set_engine('kknn') %>%
set_mode('classification')
knn_mod <- nearest_neighbor_kknn_spec %>%
fit(Species ~ ., iris_scaled_train)
knn_last_fit <- last_fit(nearest_neighbor_kknn_spec, recipe(Species ~ ., data = iris_scaled), iris_scaled_split)
knn.a.scaled <- accuracy(cbind(iris_scaled, predict(knn_mod, iris_scaled)), Species, .pred_class)$.estimate
res.iris$knn <- predict(knn_mod, iris_scaled)$.pred_class
knn.plot.scaled <- ggplot(res.iris, aes(x=knn, fill=species)) +
geom_bar() +
labs(title = "Knn model",
x="Species from KNN model")
ggplotly(knn.plot.scaled)
fviz_nbclust(select_if(iris_scaled, is.numeric), kmeans, method = "wss")
res.km <- kmeans(select_if(iris_scaled, is.numeric), centers = 3, nstart = 25)$cluster
res.km <- as.factor( ifelse(res.km == 1, "virginica", ifelse(res.km == 2, "versicolor", "setosa") ) )
# table(iris$Species, res.km)
km.a.scaled <- accuracy(cbind(iris, res.km), Species, res.km)$.estimate
res.iris$km <- res.km
km.plot.scaled <- ggplot(res.iris, aes(x=km, fill=species)) +
geom_bar() +
labs(title = "Kmeans model",
x="Species from kmeans model")
ggplotly(km.plot.scaled)
xgboost_parnsnip <-
boost_tree() %>%
set_engine('xgboost') %>%
set_mode('classification')
res.xgboost <- xgboost_parnsnip %>%
fit(Species ~ ., data = iris_scaled) %>%
predict(iris_scaled) %>%
pull(.pred_class)
xgboost.a.scaled <- accuracy(cbind(iris_scaled, res.xgboost), Species, res.xgboost)$.estimate
res.iris$xgboost <- res.xgboost
xgboost.plot.scaled <- ggplot(res.iris, aes(x=xgboost, fill=species)) +
geom_bar() +
labs(title = "Xgboost model",
x="Species from Xgboost model")
ggplotly(xgboost.plot.scaled)
ranger_parnsnip <-
rand_forest() %>%
set_engine('ranger') %>%
set_mode('classification')
res.ranger <- ranger_parnsnip %>%
fit(Species ~ ., data = iris_scaled) %>%
predict(iris_scaled) %>%
pull(.pred_class)
ranger.a.scaled <- accuracy(cbind(iris_scaled, res.ranger), Species, res.ranger)$.estimate
res.iris$ranger <- res.ranger
ranger.plot.scaled <- ggplot(res.iris, aes(x=ranger, fill=species)) +
geom_bar() +
labs(title = "Ranger model",
x="Species from Ranger model")
ggplotly(ranger.plot.scaled)
The objective right now is to compare the 4 models. 2 methods : graphically, and with accuracy of each model.
Four graphs on the top : models without PCA, four models int he bottom : with PCA.
(knn.plot.scaled + km.plot.scaled)/ (xgboost.plot.scaled + ranger.plot.scaled)
The accuracy table :
data.frame(model = c("knn", "kmeans", "xgboost", "ranger"),
accuracy = c(knn.a, km.a, xgboost.a, ranger.a),
accuracy_PCA = c(knn.a.pca, km.a.pca, xgboost.a.pca, ranger.a.pca),
accuracy_scaled = c(knn.a.scaled, km.a.scaled, xgboost.a.scaled, ranger.a.scaled))
## model accuracy accuracy_PCA accuracy_scaled
## 1 knn 0.9733333 0.9466667 0.96
## 2 kmeans 0.3200000 0.2600000 0.32
## 3 xgboost 1.0000000 1.0000000 1.00
## 4 ranger 0.9800000 1.0000000 0.98
iris_centered <- scale(iris %>% select(-Species), center = TRUE, scale = TRUE) %>%
as_tibble() %>%
mutate(Species = iris$Species,
Species = case_when(iris$Species == "setosa" ~ 1,
iris$Species == "virginica" ~ 2,
iris$Species == "versicolor" ~ 3),
Species = as_factor(Species))
res.h <- hclust(dist(iris_centered), method = "complete")
plot(res.h, hang = -1, cex = 0.6)
# table(iris$Species,
# cutree(res.h, k=3))
res.iris <- tibble(species = iris$Species,
hclust = cutree(res.h, k=3) ) %>%
mutate(hclust = case_when(hclust == 1 ~ "setosa",
hclust == 2 ~ "virginica",
hclust == 3 ~ "versicolor") )
iris_centered_split <- initial_split(iris_centered, prop = 0.7)
iris_centered_train <- iris_centered_split %>% training()
iris_centered_test <- iris_centered_split %>% testing()
nearest_neighbor_kknn_spec <-
nearest_neighbor() %>%
set_engine('kknn') %>%
set_mode('classification')
knn_mod <- nearest_neighbor_kknn_spec %>%
fit(Species ~ ., iris_centered_train)
knn_last_fit <- last_fit(nearest_neighbor_kknn_spec, recipe(Species ~ ., data = iris_centered), iris_scaled_split)
knn.a.centered <- accuracy(cbind(iris_centered, predict(knn_mod, iris_centered)), Species, .pred_class)$.estimate
res.iris$knn <- predict(knn_mod, iris_centered)$.pred_class
knn.plot.centered <- ggplot(res.iris, aes(x=knn, fill=species)) +
geom_bar() +
labs(title = "Knn model",
x="Species from KNN model")
ggplotly(knn.plot.centered)
fviz_nbclust(select_if(iris_centered, is.numeric), kmeans, method = "wss")
res.km <- kmeans(select_if(iris_centered, is.numeric), centers = 3, nstart = 25)$cluster
res.km <- as.factor( ifelse(res.km == 1, "virginica", ifelse(res.km == 2, "versicolor", "setosa") ) )
# table(iris$Species, res.km)
km.a.centered <- accuracy(cbind(iris, res.km), Species, res.km)$.estimate
res.iris$km <- res.km
km.plot.centered <- ggplot(res.iris, aes(x=km, fill=species)) +
geom_bar() +
labs(title = "Kmeans model",
x="Species from kmeans model")
ggplotly(km.plot.centered)
xgboost_parnsnip <-
boost_tree() %>%
set_engine('xgboost') %>%
set_mode('classification')
res.xgboost <- xgboost_parnsnip %>%
fit(Species ~ ., data = iris_centered) %>%
predict(iris_centered) %>%
pull(.pred_class)
xgboost.a.centered <- accuracy(cbind(iris_centered, res.xgboost), Species, res.xgboost)$.estimate
res.iris$xgboost <- res.xgboost
xgboost.plot.centered <- ggplot(res.iris, aes(x=xgboost, fill=species)) +
geom_bar() +
labs(title = "Xgboost model",
x="Species from Xgboost model")
ggplotly(xgboost.plot.centered)
ranger_parnsnip <-
rand_forest() %>%
set_engine('ranger') %>%
set_mode('classification')
res.ranger <- ranger_parnsnip %>%
fit(Species ~ ., data = iris_centered) %>%
predict(iris_centered) %>%
pull(.pred_class)
ranger.a.centered <- accuracy(cbind(iris_centered, res.ranger), Species, res.ranger)$.estimate
res.iris$ranger <- res.ranger
ranger.plot.centered <- ggplot(res.iris, aes(x=ranger, fill=species)) +
geom_bar() +
labs(title = "Ranger model",
x="Species from Ranger model")
ggplotly(ranger.plot.centered)
The objective right now is to compare the 4 models. 2 methods : graphically, and with accuracy of each model.
Four graphs on the top : models without PCA, four models int he bottom : with PCA.
(knn.plot.centered + km.plot.centered)/ (xgboost.plot.centered + ranger.plot.centered)
The accuracy table :
data.frame(model = c("knn", "kmeans", "xgboost", "ranger"),
accuracy = c(knn.a, km.a, xgboost.a, ranger.a),
accuracy_PCA = c(knn.a.pca, km.a.pca, xgboost.a.pca, ranger.a.pca),
accuracy_scaled = c(knn.a.scaled, km.a.scaled, xgboost.a.scaled, ranger.a.scaled),
accuracy_centered = c(knn.a.centered, km.a.centered, xgboost.a.centered, ranger.a.centered))
## model accuracy accuracy_PCA accuracy_scaled accuracy_centered
## 1 knn 0.9733333 0.9466667 0.96 0.96
## 2 kmeans 0.3200000 0.2600000 0.32 0.26
## 3 xgboost 1.0000000 1.0000000 1.00 1.00
## 4 ranger 0.9800000 1.0000000 0.98 0.98
For the iris dataset, the best model for predicting the species is the Xgboost model (regardless of the transformations tested). The Kmeans model is a very poor model for predicting species. The PCA and centered transformations only had an impact on the kmeans model, by reducing its accuracy.
print( paste0( "System version : ", sessionInfo()$running, ", ", sessionInfo()$platform) )
## [1] "System version : Windows 10 x64 (build 19045), x86_64-w64-mingw32/x64 (64-bit)"
print( paste0( R.version$version.string, " - ", R.version$nickname ) )
## [1] "R version 4.2.0 (2022-04-22 ucrt) - Vigorous Calisthenics"
for (package in c( sessionInfo()$basePkgs, objects(sessionInfo()$otherPkgs) ) ) {
print( paste0( package, " : ", package, "_", packageVersion(package) ) ) }
## [1] "stats : stats_4.2.0"
## [1] "graphics : graphics_4.2.0"
## [1] "grDevices : grDevices_4.2.0"
## [1] "utils : utils_4.2.0"
## [1] "datasets : datasets_4.2.0"
## [1] "methods : methods_4.2.0"
## [1] "base : base_4.2.0"
## [1] "broom : broom_1.0.1"
## [1] "corrplot : corrplot_0.92"
## [1] "dials : dials_1.1.0"
## [1] "dplyr : dplyr_1.0.10"
## [1] "factoextra : factoextra_1.0.7"
## [1] "FactoMineR : FactoMineR_2.6"
## [1] "forcats : forcats_0.5.2"
## [1] "ggplot2 : ggplot2_3.4.0"
## [1] "gtsummary : gtsummary_1.6.3"
## [1] "infer : infer_1.0.4"
## [1] "kknn : kknn_1.3.1"
## [1] "modeldata : modeldata_1.0.1"
## [1] "parsnip : parsnip_1.0.3"
## [1] "patchwork : patchwork_1.1.2"
## [1] "plotly : plotly_4.10.1"
## [1] "purrr : purrr_0.3.5"
## [1] "randomForest : randomForest_4.7.1.1"
## [1] "ranger : ranger_0.14.1"
## [1] "readr : readr_2.1.3"
## [1] "recipes : recipes_1.0.3"
## [1] "rsample : rsample_1.1.1"
## [1] "scales : scales_1.2.1"
## [1] "skimr : skimr_2.1.4"
## [1] "stringr : stringr_1.5.0"
## [1] "tibble : tibble_3.1.8"
## [1] "tidymodels : tidymodels_1.0.0"
## [1] "tidyr : tidyr_1.2.1"
## [1] "tidyverse : tidyverse_1.3.2"
## [1] "tune : tune_1.0.1"
## [1] "workflows : workflows_1.1.2"
## [1] "workflowsets : workflowsets_1.0.0"
## [1] "yardstick : yardstick_1.1.0"